Unveiling forensically relevant biogeographic, phenotype and Y-chromosome SNP variation in Pakistani ethnic groups using a customized hybridisation enrichment forensic intelligence panel

Massively parallel sequencing following hybridisation enrichment provides new opportunities to obtain genetic data for various types of forensic testing and has proven successful on modern as well as degraded and ancient DNA. A customisable forensic intelligence panel that targeted 124 SNP markers (67 ancestry informative markers, 23 phenotype markers from the HIrisplex panel, and 35 Y-chromosome SNPs) was used to examine biogeographic ancestry, phenotype and sex and Y-lineage in samples from different ethnic populations of Pakistan including Pothwari, Gilgit, Baloach, Pathan, Kashmiri and Siraiki. Targeted sequencing and computational data analysis pipeline allowed filtering of variants across the targeted loci. Study samples showed an admixture between East Asian and European ancestry. Eye colour was predicted accurately based on the highest p-value giving overall prediction accuracy of 92.8%. Predictions were consistent with reported hair colour for all samples, using the combined highest p-value approach and step-wise model incorporating probability thresholds for light or dark shade. Y-SNPs were successfully recovered only from male samples which indicates the ability of this method to identify biological sex and allow inference of Y-haplogroup. Our results demonstrate practicality of using hybridisation enrichment and MPS to aid in human intelligence gathering and will open many insights into forensic research in South Asia.


Introduction
In forensic investigations, massively parallel sequencing (MPS) with the ability to genotype multiple markers in various biological samples in a single assay with small DNA amount, delivers the potential to enhance human identification and forensic intelligence gathering. It also provides benefits in a number of areas such as admixture analysis, solving complex paternal/maternal cases. leading to an increase in the performance and cost/time-effectiveness of sensitive legal cases [1]. Identification of a person and relatedness between individuals are two of the leading matters in forensic analysis. The potential of single nucleotide polymorphisms (SNPs) to be utilized as genetic markers has made them enormously popular especially in the field of forensic DNA analysis because of various qualities they possess such as automation ability, small fragment length and frequency in the genome [2]. SNPs are more stable genetic markers in most of the sensitive situations such as ancestry cases like inheritance/kinship, provides investigative lead value in cases having no genetic profile match in DNA databases or with no suspect, and in family reconstructions in case of missing individuals and unknown human remains (where the DNA is significantly fragmented). This is because of the fact that they have comparatively low mutation rates [3]. SNP variation in pigmentation genes can also be useful for inferring visible phenotypic traits for example hair, skin and eye colour [4].
For forensic identification purposes targeted enrichment combined with massively parallel sequencing has been explored recently which targets mtDNA and nuclear SNPs [5,6]. Commercial MPS panels using standard PCR-based target enrichment have been developed to genotype many forensically relevant markers [7][8][9]. Hybridisation enrichment, an alternative approach to PCR-based target enrichment prior to sequencing, uses biotinylated probes (complementary to target regions in a DNA sample) to bind to target DNA and has proven successful on modern as well as degraded and ancient DNA [10]. This strategy can enrich for SNP loci prior to sequencing without the need for an initial PCR. Streptavidin beads magnetize to probes bound to target DNA, while unbound DNA and impurities are eliminated through a series of stringency washes. Hybridization enrichment can eliminate some issues with PCR-based approaches, particularly for primer design, and as a result much shorter fragment lengths of DNA can be captured without the need for intact PCR primer binding sites [11]. There is no requirement for complex PCR primer multiplex design for large numbers of markers and thus no limit on how many loci can be examined in a single assay [12].
The aim of the present study is to explore the implementation of emerging target enrichment and massively parallel sequencing technologies to genotype forensically relevant SNPs in samples from different ethnic populations of Pakistan. We used a customized 124-SNP forensic intelligence panel that offered a combined suite of phenotype, biogeographic ancestry and Y-chromosome (Y-chr) SNPs for comprehensive biological profiling.

Materials and methods
A step by step workflow for the experimental lab work is presented in Fig 1.

Collection of study samples and DNA extraction
Blood samples were collected from twenty-eight unrelated healthy male and female individuals belonging to different ethnic populations (Pothwari, Pathan, Baloach, Kashmiri, Gilgit and Siraiki) of Pakistan. Written acceptance was obtained from all donors with approval from the Institutional Bioethics Committee (IBC) No. #BEC-FBS-QAU2018-4. Donors had a selfdeclared ancestry, sex, and different combinations of eye and hair colour. DNA was extracted using a PureLinkTM Genomic DNA kit (Thermo Fisher Scientific Inc., Waltham, Massachusetts, USA) following the manufacturer's protocol.

Library preparation and hybridisation enrichment
Genomic DNA was sheared, converted into truncated Illumina libraries and enriched (via hybridisation to 5' biotinylated 120-mer DNA oligonucleotides (xGen Lockdown Probes) as described by Bardan (2019) [13]. A total of 125 nuclear SNPs (67 ancestry-informative; Table 1, 23 phenotypic; Table 2, and 35 Y-chromosome; Table 3, with one SNP shared between ancestry and phenotype) were include in the bait set. The 124 SNPs provide broad categorisation of continental biogeographic ancestry (African, European, Asian, Native American and Oceanian), major Y-chromosome haplogroups and hair/eye colour prediction, and were developed as a customisable panel for forensic intelligence gathering (Bardan 2019).
Enriched DNA for all 28 samples were combined into a single pool at 5nM concentration prior to paired end sequencing using Illumina MiSeq V2 with read length 2x150 base-pairs (300 cycles).

Sequencing data analysis
After sequencing of samples, reads were filtered according to the standard Illumina protocol at AGRF (Australian Genome Research Facility, Adelaide, Australia) to remove low-quality clusters, and de-multiplex by index. The raw Illumina reads were refined using the PaleoMix v1.0.1 pipeline of Schubert et al. (2014) [29]. Dual internal, P5 and P7 barcodes were used to de-multiplex sequences to each sample. To trim adapters, Adapter removal V2 [30] was used, paired reads were merged and all reads shorter than 25 base-pairs in length were eliminated. Collapsed reads were mapped to the Human Reference Genome hg19 (GRCh37) using version 0.6.2 of BWA (Burrows-Wheeler Aligner) [31]. Seeding option was disabled and a minimum mapping quality of 25 was set. PCR duplicates were eliminated so that only unique reads for genotype calling were retained. To obtain a variant calling (.vcf) file SNPs were called using SAMTools [32] mpileup/bcftools. Genotypes for the targeted SNPs of interest were then isolated by examining against a custom BED file which contains information about genomic coordinates of targeted SNP loci. A workflow summarizing key points of sequencing data analysis process is presented in

DNA phenotyping
For prediction of hair and eye colour in the study samples, 23 SNPs were analyzed using the prediction model from the HIrisPlex [4] DNA Phenotyping web tool. Genotypic data as per the tool's format was prepared in an Excel file and was input into the interface in order to generate probabilities that samples belong to a particular phenotypic class of hair and eye colour. For eye colour, the current prediction framework given by [4] says that the most likely eye colour is indicated by highest (probability) p-value. For hair colour, current interpretation guidelines combine two parameters i.e. highest p-value and shade probability values (either light or dark) to infer the most probable hair colour.

BGA prediction
For biogeographic ancestry (BGA) assignment of each target sample, 67 ancestry informative SNPs from each sample genotype were compared to a reference population data set consisting of genotypes from 368 individuals belonging to different regions i.e. 99 individuals from African population (AFR), 89 from East Asian population (EAS), 88 European (EUR), 64 Native EUR informative SNP at rs16891982 is also included in the phenotype SNPs. Ancestry groups are: East Asian (EAS), African (AFR), EUR-European, Native American (AMR) and Oceanian (OCE). Tri-allelic SNPs are ancestry informative but also serve to monitor for contamination from more than 1 DNA donor.

Sex determination and inference of Y-chromosome haplogroup
All the 35 Y-chromosome SNPs were recovered from all twenty-one male samples. No Y-chr SNPs were called for any of the female samples. Genotype data for all samples has been provided in S1 File. Based on the presence versus absence of Y-chr SNPs all twenty-eight samples were predicted accurately as male or female. Y haplogroup was defined by analyzing SNP data for each sample in which diagnostic ancestral and derived SNPs were observed and assigned in PhyloTree. The output for R1 sample is shown in Fig 3 as an example of the results. In this way haplogroups were assigned to all male samples. Inferred Y-haplogroups reconciled against self-declared lineage for all male samples and results have been summarized in Table 4.

Estimation of externally visible characteristics
From each of the twenty-eight DNA samples, all phenotype SNPs were obtained successfully. The HIrisPlex correctly predicted eye colour for reported blue and brown eye colours as summarized in Table 5. This data shows highest P-values out of all predicted values for colour of eye and hair and for hair shade. Most probable hair colour is the result of combined information of hair colour and shade probability values. Eye colour was predicted accurately for all of the samples based on the highest p-value except R7 and PT32 for which eye colour predicted as blue instead of brown (actual eye colour observed) giving an overall prediction accuracy of 92.8%. Predictions were consistent with reported hair colour for all samples, using the combined highest p-value approach and step-wise model incorporating probability thresholds for light or dark shade.

Assignment of biogeography ancestry
From each of the twenty-eight samples, all 67 biogeographic ancestry SNPs were obtained successfully. All likelihood ratios were at least 1 billion times more likely EUR one population over any of the other four populations, with the exception of K3 and P12 (Table 6). In PCA analysis the first PC1 and second PC2 components respectively observed as 29.64% and 20.18% of the total variance. All four reference population samples form separate clusters, although EAS, AMR and OCE are less clearly separated (Fig 4). The 28 Pakistani samples sit intermediate between the EUR and EAS/AMR/OCE clusters in the PCA (Fig 4) but there is no clear separation between samples from different ethnic groups. Biogeographic ancestry predictions are inconsistent with self-declared ancestry as per Snipper results due to limitation in accurately accounting for admixture by the tool and the absence of SNPs in the panel that can distinguish South Asian ancestry from European or East Asian. Therefore, use of some additional SNPs especially for differentiation of South Asian populations from those to the west and east will help in differentiating between these populations. Moreover, the reference dataset

Discussion
Human identification is a complex process that is important for social and legal reasons. In forensic investigations, MPS can enhance the potential of human identification and help resolve mixture complexities [1]. For SNP typing of samples in forensic investigations, there are many recent MPS approaches that show promise for generating information for multiple markers in a single process [8,9]. The latest hybridisation enrichment strategies for MPS analysis of DNA samples have enhanced opportunities to obtain volumes of genetic data for forensic intelligence and identification purposes [5,6,39]. Predicting physical characteristics from DNA as a biological source termed as forensic DNA phenotyping has gained popularity within forensics due to the potential intelligence information it can provide [40,41]. This facilitates sensitive investigations in which conventional DNA profiling fails or does not provides useful outcomes. There are already developed and forensically authenticated systems consisting of specific markers designated for specific tasks. One example is the IrisPlex system which is a dedicated DNA test for eye colour prediction [42]. Likewise, HIrisPlex as used in the present study combines the SNPs for both eye and hair colour prediction in its system [43]. We analysed samples from different ethnic groups of Pakistan to represent different hair and eye colours. It has been investigated this way that the inclusion of the phenotype SNPs with the ancestry and Y-chr SNPs using a hybridisation enrichment technology gives results that are consistent with known phenotype. Brown and blue eye colours were predicted accurately in all cases in research by [4], however intermediate eye colours remained problematic to predict, giving an overall 83% prediction accuracy of the SNPs to infer eye colour. Interestingly, when excluding the intermediate eye colour category (sometimes explored due to the potential inaccuracies in predicting intermediate eye colour against observed eye colour) [4], the prediction accuracy increases to 92% when grouping individuals into 'brown' and 'not brown' eye colour categories. Given that pigmentation in eye colour is a complex trait which can be subjective to report [44], and that intermediate eye colour has demonstrated a lower prediction accuracy than other eye colours in previous studies [4,

PLOS ONE
Forensic human identification by targeted sequencing 45,46], this result is not unexpected. For samples in the present study, a 100% prediction accuracy was achieved across the twenty-eight samples for hair colour. Predictions were consistent with reported hair colour for all samples, using the combined highest p-value approach and step-wise model incorporating probability thresholds for light or dark shade. Eye colour predicted accurately for all of the samples based on the highest p-value except R7 and PT32 for which eye colour predicted as blue instead of brown (actual eye colour) giving prediction accuracy of 92.8%. Again, previous studies have documented inaccuracies with predicting hair colour phenotypes (down to a 73% prediction accuracy on average), particularly with blond and brown categories [4,46]. For both hair and eye colour, the prediction accuracy shown in this study is consistent with previous error rates established in earlier studies of the HIrisPlex SNP panel [4,43]. Since the design and execution of the panel used in the present research, a latest HIrisPlex panel has been published, called HIrisPlex-S assay which includes additional 17 SNP markers in pigmentation genes which provides additional facilitation of inferring skin colour [47,48]. As a further consideration, these SNPs could easily be incorporated in to the customized enrichment panel as per needs which can serve as a further intelligence tool. Nonetheless, this study has demonstrated the successful use of the HIrisPlex panel in a hybridisation Table 5. Inferred eye colour and most probable hair colour associated probabilities in terms of P-value for twenty-eight samples with known hair and eye colour using the HIrisplex SNPs in the custom enrichment panel.  Table 6. Inferred biogeographic ancestry using snipper for male and female samples under study.

B2
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR

B4
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AFR

B5
Asian Asia 10E+9 times more likely to be EUR than AFR and 10E+9 times more likely to be EUR than EAS

B6
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR

G9
Asian Asia 10E+9 times more likely to be EUR than AMR and 10E+9 times more likely to be EUR than EAS

Gil7
Asian Asia 10E+9 times more likely to be EUR than AMR and 10E+9 times more likely to be EUR than EAS

Gil8
Asian Asia 10E+9 times more likely to be EUR than AMR and 10E+9 times more likely to be EUR than EAS

Gil9
Asian Asia 10E+9 times more likely to be EUR than AMR and 10E+9 times more likely to be EUR than EAS

Gil11
Asian Asia 10E+9 times more likely to be EUR than AFR and 10E+9 times more likely to be EUR than EAS

K1
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AFR

K3
Asian Asia 2,236 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR

K4
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR

K7
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR

K8
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR

P9
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR

P11
Asian Asia 10E+9 times more likely to be EUR than AMR and 10E+9 times more likely to be EUR than EAS

P12
Asian Asia 831 times more likely to be EUR than AMR, and 70,915,529 times more likely to be EUR than EAS

P14
Asian Asia 10E+9 times more likely to be EUR than AMR and 10E+9 times more likely to be EUR than EAS

PT32
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR

PT34
Asian Asia 10E+9 times more likely to be EUR than AFR and 10E+9 times more likely to be EUR than EAS

PT39
Asian Asia 10E+9 times more likely to be EUR than AMR and 10E+9 times more likely to be EUR than EAS

PT45
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR

PT50
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR

R1
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR

R2
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR

R3
Asian Asia 10E+9 times more likely to be EUR than EAS and 10E+9 times more likely to be EUR than AMR (Continued ) enrichment approach for forensic analysis and may help to further support ancestry estimations when used in conjunction with the ancestry informative SNPs in the custom panel. Currently, the HIrisPlex model includes test data only from European populations [4].
Understanding how different populations may influence the prediction model and therefore the success rate could be improved by including reference samples from multiple non-European populations as from present study. All 67 biogeographic-ancestry SNPs were successfully retrieved from all twenty-eight samples under study. Comparative study for the target sample's SNP genotype data versus available reference population data showed that all likelihood ratios were at least 1 billion times more likely one population over any of the other four populations, with the exception of samples K3 and P12. Use of some additional SNPs especially enlightening for pairwise differentiation of east and south Asia's populations will boost the ability of the panel to differentiate between these populations. Moreover, the reference dataset used for comparison included 89 individuals from EAS population which were JPT: Japanese in Tokyo, hence it is the only representation for EAS group. Inclusion of population genotype data from various countries and ethnic groups of Asia especially Pakistan and neighbouring countries for representation of east and south Asian population groups in the reference dataset could improve final predicted results and clearer biogeography-ancestry estimation. Snipper has also limitation in accurately accounting for admixture, hence it can be concluded that samples under study showed an admixture between EAS and EUR ancestry.
Research has been dedicated for many years on the human Y-chromosome and its variation analysis especially targeting YSNPs. This effort resulted in establishing a well-defined Y chromosome phylogeny. The rise of MPS approaches in recent times is facilitating the discovery of new YSNPs which are in turn increasing resolution to discriminate between closely related Y-haplotypes. The Y-chromosome being haploid and largely non-recombining in nature, is widely used as a marker in many disciplines including forensics research [49,50], exploring structure of Y chromosome [51], and population based studies [52,53]. The Y-SNPs in the custom enrichment panel were able to predict Y-chr haplogroups for all male samples with no conflicting haplogroup classifications. No Y-chr SNP data was recovered from any of the female samples, which also indicates the capability for this method as an indication of sex. For all twenty-one male samples under study, haplogroup classifications and their associated most likely geographic affiliations were reconciled with reported self-declared ancestry. Selfdeclared ancestry and region of samples under study have been affiliated well with inferred one i.e. Asian as all samples belong to local ethnic populations of Pakistan. The panel has successfully determined informative Y-chr haplogroups and sub-haplogroups and can be considered a suitable tool for exploring the paternal lineage of male samples.
Whole genome sequencing is the only method that allows the simultaneous detection of all types of variations within a genome. In a single assay, a wide range of applications can be examined with the downstream analysis providing information about targets that need close examination. But as reads with bad quality were dropped prior to analysis, and whole genome approach yields less coverage in comparison to targeted approach which sequence only loci of interests. Targeted approach identifies those variants that get skipped as a result of whole genome sequencing [54]. It eliminates redundant and unnecessary genetic variations that can lead to distraction from direct interpretation. It is cost and time effective option, especially when a large number of target samples are under study like present research [54].

Conclusion
In an attempt to analyze various marker types together in one analytical workflow for forensic human intelligence information, a novel customisable hybridisation enrichment forensic intelligence panel has been used in the present research which provided new avenues and opened many insights to forensic human identification. This panel facilitates a technical approach that permits the possibility of using customisable SNP marker sets relevant to the question under study for hybridisation enrichment prior to MPS. The panel has distinguished biogeographic ancestry of each study sample between five major continental populations by successfully targeting 67 ancestry informative markers. Y-chr SNP analysis helped in sex determination and assigning haplogroups. Retrieval and analysis of externally visible characteristics (EVCs) such as eyes and hair colour has been achieved by targeting genomes with 23 phenotype markers and HIrisPlex phenotyping tool results match well with previously established success rates. SNPs that are helpful for prediction of more external physical traits, SNPs for biogeography lineage prediction or any additional SNPs that can facilitate in forensic research can be used as individual or in combined customisable panel to facilitate advanced outcomes. An example is recent introduction of HIrisPlex-S system that covers additional 17 SNPs in its panel that facilitates prediction of skin colour along with hair and eyes. The overarching objective of the present research was to explore and use the latest techniques to increase the likelihood of drawing inferences regarding phenotype and lineage from modern human DNA for forensic investigations in Pakistan.